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Abstract 

Background: Chromatin immunoprecipitation combined witin massive parallel sequencing (ChlP-seq) is widely 
used to study protein-chromatin interactions or chromatin modifications at genome-wide level. Sequence reads 
that accumulate locally at the genome (peaks) reveal loci of selectively modified chromatin or specific sites of 
chromatin-binding factors. Computational approaches (peak callers) have been developed to identify the global 
pattern of these sites, most of which assess the deviation from background by applying distribution statistics. 

Results: We have implemented MeDiChlSeq, a regression-based approach, which - by following a learning process - 
defines a representative binding pattern from the investigated ChlP-seq dataset. Using this model MeDiChlSeq 
identifies significant genome-wide patterns of chromatin-bound factors or chromatin modification. MeDiChlSeq has 
been validated for various publicly available ChlP-seq datasets and extensively compared with other peak callers. 

Conclusions: MeDiChl-Seq has a high resolution when identifying binding events, a high degree of peak- 
assessment reproducibility in biological replicates, a low level of false calls and a high true discovery rate when 
evaluated in the context of gold-standard benchmark datasets. Importantly, this approach can be applied not only 
to 'sharp' binding patterns - like those retrieved for transcription factors (TPs) - but also to the broad binding patterns 
seen for several histone modifications. Notably, we show that at high sequencing depths, MeDiChlSeq outperforms 
other algorithms due to its powerful peak shape recognition capacity which facilitates discerning significant binding 
events from spurious background enrichment patterns that are enhanced with increased sequencing depths. 
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Background 

Chromatin immunoprecipitation (ChIP) combined with 
high throughput sequencing is widely used for character- 
izing the genome-wide association pattern of chromatin- 
interacting factors and histone or DNA modifications, 
for which selective tools for affinity purification, mostly 
antibodies, exist. While ChlPed DNA was first analysed 
at genome-wide level by hybridization to genomic tiling 
arrays (also known as ChlP-on-chip or ChlP-chip), dir- 
ect sequencing is generally used these days (referred to 
as ChlP-seq). Massive parallel sequencing has overcome 
several limitations of the array-based (ChlP-chip) ap- 
proach; such as spatial resolution, signal-to-noise ratio, 
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dye and the probe-dependent hybridization biases and 
costs (for a detailed comparison of the two approaches 
see [1]); thus ChlP-seq is becoming the method of 
choice for mapping protein-chromatin interactions and 
chromatin modifications at global level. 

Irrespective of whether ChlP-chip or ChlP-seq is used, 
the aim of the corresponding data analysis is to identify 
patterns in the reconstructed signal profiles that reflect 
the bona fide enrichment of the factor/modification of 
interest across the entire genome. Several pattern recon- 
struction methodologies have been described to date 
using approaches based on different concepts to define 
what constitutes an enrichment event or peak. The sim- 
plest concept defines an enrichment region based on a 
user-chosen read count intensity threshold [2,3]. Other 
methodologies evaluate background levels from control 
(non-enriched) datasets to assess enrichment confidence 
p-values in the chromatin immuno-precipitated (ChIP) 
profile from a binomial distribution model [4,5]. In the 
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same manner, when no control samples are available, the 
background is usually estimated from a Poisson distribu- 
tion model [4,6,7]. 

In the last years another group of peak callers was de- 
veloped which use the signal enrichment dependency in 
a spatial context to discover significant binding events 
[8-11]. Importantly, this new family of peak callers de- 
fines significant binding events from the consecutive 
behaviour of enriched and non-enriched regions by ap- 
plying Hidden Markov models (HMM), thus assessing 
its significance from enrichment properties rather than 
describing only differences relative to the background. 

Finally, a new generation of peak callers exploits the 
properties of expected binding patterns. Among them, 
PeakRanger complements the use of the background 
modelling by using in a second round a "summit-valley- 
alternator" algorithm to scan for significant summits 
[12]. Others assess the shape of the observed binding 
patterns either by applying topological tree-based statis- 
tics [13], or by elucidating properties of the forms asso- 
ciated to the enrichment profiles [14]. 

Here we introduce MeDiChlSeq, a model-based de- 
convolution approach, originally developed to evaluate 
ChlP-chip profiles [15]. Importantly, MeDiChlSeq takes 
advantage of the shape of the binding event itself as a re- 
source for identifying them in an accurate manner; thus 
by providing a higher power of discrimination between 
true binding events and artifactual read-count enrich- 
ment patterns. MeDiChlSeq computes a model from a 
selected subset of the multiple binding events that con- 
stitute a genome-wide profile; then, this model is used as 
a deconvolution kernel to predict global binding/modifi- 
cation events, which are further validated by applying a 
non-parametric bootstrapping approach. The perform- 
ance of MeDiChlSeq has been compared with various 
other peak callers that are representative of the different 
approaches currently used to define significant binding 
events in ChlP-seq profiles. 

Implementation 

MeDiChlSeq is based on MeDiChI, a model-based 
deconvolution method designed for the analysis of 
ChlP-chip datasets [15] to which an important number 
of novel implementations have been added to enable the 
analysis of datasets generated by massive-parallel se- 
quencing. Specifically, while the regression-based calcu- 
lator embedded on MeDiChI is essentially the same in 
MeDiChlSeq (see Additional file 1), the major novel 
implementations incorporated for transforming MeDi- 
ChI into a Peak caller dedicated to the analysis of ChlP- 
Seq datasets comprise (1) the preprocessing of mapped 
sequence files to generate read-count intensity files com- 
patible with MeDiChI readout; (2) the enhancement of 
the peaks' confidence assessment by including local and 



global background comparisons as well as the use of 
input control datasets when available and (3) the 
implementation of a multicore processing structure to 
accommodate computation requirements observed when 
MeDiChI was applied to larger genomes than those that 
have been used for its release. These novel implementa- 
tions are described below in more detail. 

ChlP-seq datasets 

MeDiChlSeq processes mapped sequence files in differ- 
ent formats (e.g. BED, BAM). Read-count intensity 
profiles are reconstructed from mapped read files by 
elongating each read to a user-defined length (default 
read elongation: 150nt) and counting the elongated read 
overlaps within a defined window (default wiggle-format 
files resolution: lOnt). While the read elongation param- 
eter is generally provided by the user, we have incorpo- 
rated in MeDiChlSeq a function that predicts a suitable 
read elongation from the information retrieved in the 
ChlP-seq profile itself (see Additional file 2 and below). 

In ChlP-chip the reconstructed signal intensity is gen- 
erated by comparing the immunoprecipitated informa- 
tion (IP) with the control dataset, while in ChlP-seq the 
IP and control datasets are processed separately. There- 
fore, MeDiChlSeq takes as an additional file (when avail- 
able) a control dataset for improving the confidence 
assessment of the identified binding events (see below). 

Establishing a representative binding pattern by applying 
an iterative learning process 

One of the main advantages of MeDiChlSeq is its 
capacity of inferring a representative binding pattern 
(referred to as "kernel") from the provided ChlP-Seq 
dataset. As illustrated in the Additional file 3, this is per- 
formed by fitting a binding pattern model to a reduced 
number of genomic regions, which are selected by apply- 
ing a read-count intensity cutoff criterion. This cutoff 
can be defined as a given read-count intensity or by a 
quantile intensity parameter. 

Model fitting is performed in an iterative manner by 
evaluating in each round the number of peaks that fit 
the best to the current model and adjusting its parame- 
ters (shape and scale of Gamma distribution) by minim- 
izing the regression residuals. The formalism of this 
procedure is extensively described in [15] and its imple- 
mentation for ChlP-seq datasets is detailed in the 
Additional file 1. 

Sequenced reads elongation parameter inferred from 
ChlP-seq strand-specific information 

In ChlP-seq assays the reconstruction of factor binding/ 
chromatin modification profiles is currently performed 
by applying a computational elongation of the sequenced 
reads prior read-count intensities assessment. This 
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elongation step is performed because each sequenced 
read corresponds to the 5'-ends of the fragmented 
immunoprecipitated chromatin. Importantly, the appUed 
elongation distance, which corresponds generally to 
the fragmented chromatin length, is important for 
proper assignment of a factor binding site or chromatin 
modification. 

While the read elongation parameter is generally pro- 
vided by the user (based on the experimental condi- 
tions), we have incorporated in MeDiChlSeq a function 
that could predict a suitable read elongation from the in- 
formation retrieved in the ChlP-seq profile itself In fact, 
while previous studies assessed the read elongation dis- 
tance by evaluating the distance between the forward 
and reverse enriched reads [7], MeDiChlSeq applies the 
iterative linear regression model fitting in a strand- 
specific manner without read elongation. This prelimin- 
ary step infers the DNA fragment length per strand 
(ideally both strand-specific fragment lengths are the 
same); subsequently these are combined to define the 
read elongation parameter (see Additional file 2). 

Genome-wide identification of significant binding events 
by using the modeled representative binding pattern 
(l<ernel) 

Binding site identification by MeDiChlSeq is based on 
the assumption that all binding patterns associated to a 
given immunoprecipitation assay might present similar 
peak shape characteristics. Thus, the representative 
binding pattern or kernel modeled by the iterative re- 
gression approach is used to deconvolve binding events 
over the entire ChlP-seq profile. For this, the dataset is 
subdivided into overlapping windows (default parameter 
20,000 nt window length; a contiguous window overlap 
is defined to cover at least one peak length) and the pre- 
computed kernel is used to identify those enrichment 
patterns that fit best. 

Like in the case of MeDiChI, the likelihood of an enrich- 
ment event to match the trained kernel is related to the 
ChlP-seq background and is estimated by applying a 
non-parametric bootstrap approach [15]. MeDiChlSeq 
compares for this purpose the putative binding sites iden- 
tified by kernel fitting with the "kernel-fitting residuals" 
(i.e., those not complying with the model, and correspond- 
ing to the background). Moreover, these residuals are fur- 
ther deconvolved to identify potential patterns that would 
match with the operative kernel despite their possible 
background characteristics. Finally, each putative binding 
site is compared with its surrounding background in a 
local (default size of this centered surrounding windows is 
5,000; 10,000 and 15,000nt) and global (genome-wide 
background) context. 

The use of three different window sizes facilitates clas- 
sifying the surrounding of potential binding sites as 



background. MeDiChlSeq provides to each identified 
binding site local confidence p-values for all three evalu- 
ated windows and a global p-value. To provide an over- 
all confidence estimate based on both global and local 
p-values, these descriptors were combined into a single 
confidence indicator (Fisher's combined probability test). 

When available, a control dataset (e.g., non-enriched 
sample or IP with non-specific IgG) is included during 
the binding site assessment. Indeed, whenever an enrich- 
ment event matches with the trained kernel, the kernel- 
fitting process is also performed in the control dataset 
for the corresponding genomic region. If in a given 
chromatin region both the enrichment and the control 
dataset comply with the trained kernel, the confidence 
of the identified binding site in the immunoprecipitated 
dataset is corrected as follows: 

Confidence Threshold (CT) = -log (p-value"') 
*Intensity'''-(-log(p-value^°""'°') * Intensity'^™'™') 

This approach enhances the confidence of the pre- 
dicted binding event by evaluating its uncertainty from 
different perspectives, namely relative to a local back- 
ground, relative to the identified patterns across the gen- 
ome (global background) and relative to the enrichment 
seen in the control sample. Note that the described con- 
trol sample-based confidence correction is based on the 
assumption that the compared datasets (IP and control) 
present comparable sequencing depth levels. It is im- 
portant to mention that some methodologies apply in 
case of divergent sequencing depths linear scaling cor- 
rections; however we have shown in a previous study 
that important differences in sequencing depths may 
give rise to non-linear differences between compared 
datasets [16]. 

In contrast to other methods, we do not suggest a de- 
fault p-value threshold but provide a comprehensive list 
of all identified binding sites (complying with the kernel 
fitting) and their associated confidence descriptors such 
that the user can chose the optimal confidence thresh- 
old. In fact, defining a default p-value threshold may be 
misleading for inexperienced users, who may consider 
such reference as a gold standard rather than evaluating 
by other approaches the degree of false calls for a given 
p-value. Instead, we propose a graphical approach for 
estimating p-value levels, which may preferentially be 
associated to background behavior (described in the 
MeDiChlSeq vignette; Additional file 4). 

MeDiChlSeq implementation and performance 

MeDiChlSeq has been implemented in R and is designed 
to operate by multicore processing to accommodate 
computation requirements during linear regression fit- 
ting and bootstrapping. 
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For users who are interested in evaluating sites identi- 
fied by another peak caller, MeDiChlSeq offers an option 
in which only defined regions can be deconvolved. The 
R package, vignette and manual of MeDiChlSeq are 
available as additional files (Additional files 4, 5 and 6). 
Note that these files can also be downloaded from our 
homepage http://igbmc.fr/Gronemeyer_MeDiChISeq. 

Results and discussion 

In this study the performance of MeDiChlSeq has been 
evaluated with a large number of publicly available 
ChlP-seq datasets. These include the TFs SRF, MAX, 
NRSF [17] and the sequence-specific insulator protein 
CTCF [18], all of which present sharp peaks in their 
ChlP-seq profiles. Moreover, also broad patterns charac- 
teristic of some histone modifications, such as histone 
H3 lysine 4 trimethylation (H3K4me3), and lysine 9 
(H3K9Ac) or lysine 27 (H3K27Ac) acetylation, were also 
included using published data sets [18]. Importantly, 
MeDiChlSeq performance was compared to three other 
peak callers, which are representatives of the different 
methodologies implemented over the years: MACS 
models the background according to a Poisson distribu- 
tion [7], BayesPeak takes advantage of a fully Bayesian 
hidden Markov model to identify binding events [10], 
and PeakRanger applies in addition to background mod- 
elling, in a second round a "summit-valley-alternator" 
algorithm to scan for significant summits [12]. The rele- 
vant parameters in which each peak caller has been used 
are provided in the Additional file 7. As illustrated in 
Figure lA, all four peak callers predict a variable number 
of significant peaks when default confidence threshold 
conditions are used (MACS: p-value < 1x10 ; Baye- 
sPeak: posterior probability or PP > 0.5; PeakRanger: 
p-value < 1x10 "*, FDR < 0.01; MeDiChlSeq: no confi- 
dence cutoff; instead the number of peaks complying 
with the kernel are given) suggesting a priori that default 
parameters may have to be optimized for comparative 
studies (see also [19]). In general, we observed that Peak- 
Ranger and MACS display a more conservative behav- 
iour than MeDiChlSeq and BayesPeak when comparing 
the total number of predicted peaks. Note that the num- 
ber of MeDiChlSeq sites corresponds to those comply- 
ing with the kernel fitting and have not been filtered by 
any other threshold criterion. Even more importantly, 
differences in the number of sites identified by each peak 
caller are observed also with biological replicates, which 
likely reflect inherent differences in the characteristics of 
each of such datasets. Note that in the present study we 
considered only ChlP-seq profiles that were published as 
biological replicates. 

To compare commonly identified sites we used the 
predicted peak summits ±50nt flanking sequences; as 
BayesPeak does not specify summits, the centre of the 



predicted peak base was used. This comparison revealed 
that MeDiChlSeq identified the majority of sites pre- 
dicted by the other methods (Figure IB, Additional 
file 8). Notably, when comparing the fraction of peaks 
shared among peak callers MeDiChlSeq performs best 
for both sharp and broad binding patterns (CTCF and 
H3K4me3), while most of the other peak callers present 
significantly lower fractions of shared peaks, as seen for 
H3K4me3 (Figure IB). This observation correlates with 
the high number of MeDiChlSeq-identified sites relative 
to the other peak callers resulting from the efficient de- 
convolution by MeDiChlSeq. In fact, as illustrated in 
Figure IC, MeDiChlSeq annotated 8 distinct loci of 
H3K4me3 chromatin modifications, where the other 
peak callers identified one, two or three sites. We noted 
that these differences in the deconvolution potential of 
peak callers were less pronounced for sharp binding pat- 
terns (Figure IC, left panel). 

MeDiChlSeq's sensitivity evaluated by their performance 
in reproducibility assays 

Figure 1 illustrates that a comparison of peak caller per- 
formances under default parameters is unsatisfactory. In 
fact, default confidence thresholds that are too relaxed 
will increase the amount of false positive calls, while too 
stringent conditions will produce false negatives. To cir- 
cumvent this problem, peak caller performance can be 
evaluated in the context of reproducibility assays by 
comparing binding site predictions for biological repli- 
cates and ranking them according to confidence descrip- 
tors. The underlying assumption is that true binding 
sites will be retrieved in both replicate datasets within a 
similar confidence ranking, while low confidence peaks, 
which are expected to contain also false positives, will 
show lower consistency in the reproducibility assay. The 
consistency between ranked peak confidence descriptors 
was previously formalized based on a copula mixture 
model, which estimates the probability that each pair of 
peaks is reproducible. This probability was described as 
"Irreproducibility Discovery Rate (IDR) [20] and has 
been used by the ENCODE consortium to identify a 
transition from signal to noise when peak caller binding 
site predictions were evaluated [21]. 

Here we have compared peak caller performance in 
the context of reproducibility across replicate ChlP-seq 
datasets. Importantly, MeDiChlSeq showed the highest 
number of reproducible peaks in CTCF and NRSF ChlP- 
seq datasets (Figure 2A). Also for broader patterns like 
H3K27ac and H3K4me3 MeDiChlSeq identified the 
highest number of reproducible peaks at acceptable IDR 
thresholds (e.g., 0.1 or 90% reproducible discovery). Note 
that the IDR progression curve for the histone modifica- 
tion mark H3K4me3 continues to increase rather slowly 
above this threshold, suggesting retrieval of an important 
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Figure 1 MeDiChlSeq performance evaluated In the context of several ChlP-seq datasets and relevant Peak calling algorithms. 

(A) MeDiChlSeq and three other peak callers (MACS, BayesPeak and PeakRanger) were used to identify binding events in ChlP-seq datasets for 
three TPs (SRF, MAX, NRSF), the sequence-specific insulator protein CTCF and two histone modification marks (H3K9Ac, H3K27Ac, H3K4me3). The 
default confidence threshold parameters described for each peak caller were applied to assess the number of peaks per dataset. Note that for 
each ChlP-seq two biological replicates were processed. (B) Peaks commonly identified by two of the indicated peak callers for two replicates of 
CTCF (top panel) and H3K4me3 (bottom panel) are displayed as percentages of the total sites found by a given method (indicated at the right). 
(C) Representative genome browser screenshots illustrating the ability to deconvolve binding/modification patterns of peak callers. Note that 
most of the peak callers identify a similar number of "sharp" binding events for CTCF, while MeDiChlSeq has the highest potential of deconvolution for 
the H3K4me3 pattern. 



number of irreproducible events among the significant 
top-ranked peaks in the repUcate dataset. That the other 
peak callers identify less than 5,000 H3K4me3 peaks 
with IDR levels below 10% supports the view that for 
broader binding patterns the assessment of IDRs by ap- 
plying standard approaches becomes suboptimal [20]. 

An important limitation of the above analysis is the 
potential variability between compared biological repli- 
cate datasets, as technical differences between the 
compared profiles may exist (e.g., sequencing depth 
differences; Peak caller deconvolution performance 



for broad patterns). To circumvent this limitation, we 
treated the predictions of two peak callers as "virtual 
replicates" for IDR analyses for a number of individual 
ChlP-seq datasets (Figure 2B). We thus ask if two peak 
callers identify binding events/marks in the same ChlP- 
seq dataset with similar confidence (i.e., if a top ranked 
peak of peak caller A is also top ranked by peak caller 
B). This novel type of comparison demonstrated that 
MeDiChlSeq identifies higher numbers of reproducible 
peaks when compared with other methods. In fact, 
in the case of CTCF datasets, MeDiChlSeq-MACS 
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Figure 2 Irreproducibillty Discovery Rate (IDR) assays to compare peak calling algorithms. (A) IDR assay comparing biological replicate 
datasets (see text for details). Note that for H3K4me3, IVleDiChlSeq continues to find significant common events in compared replicate datasets 
with slowly increasing IDR while the IDRs sharply increase for the three other peak callers around 5,000 significant peaks commonly identified in 
the replicates. (B) Similar reproducibility analysis but performed by pairwise comparison of binding site predictions by the different peak callers 
indicated at the left ("virtual replicates"). This approach reflects the concordance in binding site identification between two peak callers. Note that 
in all illustrated IDR assays, IVleDiChlSeq predictions have the lowest IDR levels for the highest number of significant binding sites. In (A) and (B) 
dashed lines indicate IDR levels of 0.1; i.e. a reproducibility level of 90%. 



performed best for the first replicate, while MeDiChl- 
Seq-PeakRanger won in the case of the second replicate 
(Figure 2B). Importantly, evaluation of H3K27ac and 
H3K4me3 ChlP-seq datasets by this approach revealed 
large differences in reproducibility performance of the 
peak callers. PeakRanger and BayesPeak systematically 
performed worst, while MeDiChlSeq versus any other 
peak caller gave the best scores in either biological repli- 
cate. Note that the particular IDR patterns observed for 
H3K4me3 in an inter-replicate comparison (right panel 
in Figure 2A) was not seen when the inter-peak caller 
performance for each replicate dataset was compared 
(right panels in Figure 2B), suggesting that it results 
from significant divergence between the "biological 
replicate datasets". Overall these analyses showed that 
MeDiChlSeq systematically identified the most reprodu- 
cible events among biological replicates and peak caller 
annotations, thereby revealing the high sensitivity and 
reliability of this approach. 



MeDIChlSeq's specificity in the context of curated 
benchmark datasets 

In addition to identifying the highest number of true 
binding events (sensitivity), a good peak caller algorithm 
is expected to produce the lowest amounts of false posi- 
tives (specificity). As indicated above, IDR studies are 
expected to identify a transition from signal to noise 
when evaluating peak callers' binding sites reproducibil- 
ity. In this manner, the highest number of significant 
binding sites at the lowest IDR, as observed in the case 
of MeDiChlSeq performance (Figure 2), reflects a high 
degree of sensitivity and specificity, at least in the con- 
text of reproducible binding site discovery in biological 
replicates or when comparing different peak callers' per- 
formance per ChlP-seq dataset. 

Previous studies have evaluated peak caller perform- 
ance to distinguish false positives from "true" binding 
sites by using a manually curated collection of binding 
regions (and "false" enrichment sites) that cover typical 
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variation in peak size and appearance for the TFs NRSF, 
SRF and MAX [14,17]. We have evaluated MeDiChlSeq 
in the context of this benchmark, demonstrating in all 
three cases a high percentage of true binding site recov- 
ery (> 80%) and low false discovery rate (Figure 3). It is 
worth mentioning that, while its overall FDR perform- 
ance is similar to that of MACS, MeDiChlSeq generally 
retrieves more true binding sites. Furthermore, while 



using a background control dataset affected the false dis- 
covery rate of all other evaluated peak callers, MeDiChl- 
Seq performed equally well in identifying true binding 
events in presence and absence of this control. This is 
most likely due to the fact that the binding site identifi- 
cation relies on a pre-computed kernel and is thus less 
affected by artifactual enrichment events. This perform- 
ance is well illustrated in the case of NRSF datasets. 
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Figure 3 Specificity and sensitivity of MeDiChlSeq peak predictions compared with other algorithms. A manually curated ChlP-seq bench- 
mark dataset for MAX, SRF and NRSF [17] was used to assess the percentage of true site recovery by the indicated peak calling algorithms relative 
to the false discovery rate (FDR). For the two upper panels no background control sample was used during peak calling. The two lower panels 
show the same analysis but with considering the corresponding background control dataset in the analysis. Note that PeakRanger performance is 
not illustrated in the upper panels, as this algorithm does not perform IP binding site assessment without background control. In cases where the 
tracings overlap, an arrow indicates the point of maximal recovery. 
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where in the absence of background control dataset, 
MACS and MeDiChlSeq present a maximal percent of 
recovery of 90% but accompanied by high FDR levels 
(>0.5). Importantly, the incorporation of background 
control dataset in the analysis reduces the FDR levels 
but the percentage of true site recovery is also compro- 
mised for MACS (less than 80%), while MeDiChlSeq 
manages to keep the percentage of recovery levels up to 
90% even under these conditions (FDR < 0.4). 

Peak caller performance relative to sequencing depth 

The rapid technological progress in the field of massive 
parallel sequencing provided over the years sequencing 
platforms with continuously increasing sequencing 
depths. In fact, while the first versions of the Illumina 
Genome analyzer had sequencing capacities in the range 
of several millions reads, the latest Hiseq2000/2500 ver- 
sions provides more than 3 billion reads per flow cell. 
Following this continuous progress, the number of se- 
quenced reads used per ChlP-seq assay has increased 
considerably. In fact, while early ChlP-seq assays gener- 
ated <4 million total mapped reads (TMRs), current 
datasets comprise >20 million TMRs. Importantly, in- 
creasing the sequencing depth increased also the num- 
ber of discovered binding sites [5,22,23] but only a few 



studies evaluated the peak caller performance for condi- 
tions with varying sequencing depths. Obviously, in- 
creasing the sequencing depth will increase both the 
signal and the noise levels, which could potentially affect 
peak caller performance. 

To address this question, we created a high density 
ChlP-seq dataset by combining the datasets of the 
two biological CTCF replicates. This weta-profile 
comprised >36 M TMRs and was used for profile 
reconstruction from subsets generated by random 
sampling (80%; 60%; 40%; 20%; 10%; 5%) (Additional 
file 9A). To perform IDR evaluation, pseudo replicates 
were produced by two independent random samplings. 
As expected, the CTCF profile reconstructed from <2 M 
TMRs had unacceptably high IDR levels (Figure 4). In 
this condition MeDiChlSeq and PeakRanger performed 
worst, followed by MACS and BayesPeak. This is readily 
explained by the fact that both MeDiChlSeq and Peak- 
Ranger evaluate peak shapes, which are highly variable 
at low TMR levels (see pseudo replicates at 1.8 M TMRs 
in Additional file 9A). Importantly, with the increase in 
the TMR levels peak shapes consolidate and the 
performance of MeDiChlSeq is enhanced accordingly 
(Figure 4A). Indeed, above 14 M TMRs MeDiChlSeq out- 
performs all other peak callers with respect to the number 
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Figure 4 Performance of peal< calling algorithms at different sequencing depth. (A) Pseudo-replicates with different total mapped reads 
(TMRs) were created from a CTCF dataset of 36,383,621 reads. Mapped reads were twice randomly sampled to obtain fractions of 5 to 80% of 
the original dataset, as indicated in the panels. IDR assays comparing the pseudo-replicate datasets were performed for the predictions of the four 
peak-calling algorithms. (B) The number of reproducible peaks identified for an IDR threshold of 10% (IDR < 0.1) is illustrated relative to the 
sequencing depth. (C) Motif analysis performed with the reproducible binding sites specific to MeDiChlSeq (when compared with MACS) 
corresponding to 29,1 06,897 TMRs. As illustrated, more than 40% of these sites harbor a CTCF motif (top panel; Jaspar database comparison 



performed by CentriMO; p-value 4.4x10 



in the center of the predicted peaks (bottom panel). 
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of significant peaks with the lowest IDR levels. Note that 
above this TMR level all peak callers tend to reach satur- 
ation, a phenomenon generally referred to as the sequen- 
cing depth beyond which the number of newly discovered 
sites (in this case in a reproducible manner) reach a con- 
stant level (i.e. between 30 to 40 thousand sites for IDR 
levels lower than 0.1; Figure 4B). 

This comparative study clearly demonstrates a direct 
correlation between TMR size and the degree of repro- 
ducible peak identification by any of the compared peak 
callers. In addition, it shows that MeDiChlSeq, and to a 
certain degree also BayesPeak, tend to retrieve more re- 
producible binding sites than MACS and PeakRanger. 
This could be the direct consequence of different sensi- 
tivities and/or specificities of the peak callers under 
these conditions. To assess this issue we evaluated the ex- 
tent of overlap between the retrieved sites by the different 
methods relative to MeDiChlSeq. This analysis demon- 
strated that MeDiChlSeq retrieved >88% of the binding 
sites identified by the other methods, but predicted an 
additional > 10,000 specific sites (Additional file 9B). 

These additional sites may originate from the use of the 
stringent comparative conditions (summit overlap +/-50nts 
distance). Indeed, a comparison between MACS and 
MeDiChlSeq revealed that > 4,000 of the 15,000 
MeDiChlSeq-specific sites overlapped with MACS 
calls, when peak comparison settings were relaxed 
(Additional file 10). The remaining 10,000 binding sites 
that did not appear in the MACS-predicted site list 
were further evaluated for the presence of the previ- 
ously described CTCF motif. Importantly, more than 
4,000 MeDiChlSeq-specific sites (40%) contained a 
CTCF motif, strongly suggesting that this population 
corresponds to bona fide CTCF binding sites that 
were ignored by MACS (Figure 4C). Of note, Peak- 
Ranger and/or BayesPeak identified nearly 3,000 of 
these bona fide CTCF binding sites (illustrated in 
Additional file IOC, right panel). 

Conclusions 

Here we present MeDiChlSeq, a model-based deconvo- 
lution approach to assess binding events and chromatin 
marks from ChlP-seq datasets. We have previously used 
an early version of this methodology for mapping the 
chromatin localization of RXRa and RARy nuclear re- 
ceptors [24], as well as for profiling RNA polymerase II 
[16]. This report describes the implementation of MeDi- 
Chl - originally developed by David Reiss to evaluate 
ChlP-chip profiles [15] - for the analysis of datasets 
generated by massive parallel sequencing. 

From the conceptual point of view, this methodology 
applies a different rationale to define an enrichment 
event. In contrast to other peak detection algorithms, 
MeDiChlSeq uses the binding pattern properties, inhe- 



rent to the ChlP-seq profile under study, to define en- 
richment and background characteristics. Albeit other 
shape-based methodologies for binding site identifica- 
tion exist (e.g. Triform [14]; T-PIC [13]), MeDiChlSeq 
presents further conceptual advantages originating from 
the training step that defines a "consensus" binding pat- 
tern, which is then used to identify significant binding 
events at genome-wide level. While a direct comparison 
of the various shape-based methodologies would be of 
interest, these tools were not operative/available when 
we performed this study. 

The comparative analysis of MACS, BayesPeak and 
PeakRanger performance revealed that MeDiChlSeq 
identifies most of the sites predicted by other methods, 
but in addition it discovers new significant binding 
events/marks with a low level of false calls. We thus 
conclude that the incorporation of a more complex fea- 
ture to define the relevance of an enrichment event, i.e. 
the evaluation of its shape defined by a preliminary 
training process, is a major advantage for the peak call- 
ing process. While MeDiChlSeq has shown also optimal 
performance when identifying binding patterns in his- 
tone modification marks like H3K4me3 or H3K27ac, 
which present broader enrichment patterns than tran- 
scription factors, we did not perform exhaustive analyses 
on even broader pattern profiles like those observed for 
H3K36me3 or H3K27me3, because the current MeDi- 
ChlSeq release does not include enrichment island iden- 
tification, as is the case for other tools like SICER [25], 
RSEG [26] or BroadPeak [27]. Nevertheless, the present 
release of MeDiChlSeq is already able to perform opti- 
mal binding site identification also for rather broad 
enriched patterns, such as the H3K36me3 histone mark 
(Additional file 11). Importantly, such multiple site iden- 
tification recapitulates the enrichment island patterns 
identified by SICER, strongly suggesting that also MeDi- 
ChlSeq performs well in such situations. In this context, 
a further optional computational module that merges 
closely annotated binding/modification sites is being 
developed to use MeDiChlSeq outputs for enrichment 
island prediction. 
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Additional file 3: Characterising ChlP-seq Binding Patterns by 
Model-Based Peak Shape Deconvolution. 

Additional file 4: MeDiChlSeq vignette. 

Additional file 5: MeDiChlSeq manual. 

Additional file 6: MeDiChlSeq R package. 



Mendoza-Parra et al. BMC Genomics 2013, 14:834 
http://www.biomedcentral.com/1471-2164/14/834 



Page 10 of 10 



Additional file 7: Peak caller characteristics and their relevant 
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calling algorithms. 
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by other peak caller approaches. 
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